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(57) An automatic re-authoring system and method 
re-author a document originally designed for display on 
a desktop computer screen for display on a smaller dis- 
play screen, such as those used with a PDA or a cellular 
telephone. The automatic re-authoring system and 
method input a document to be re-authored and re-au- 
thoring parameters, such as display screen size, default 
font and the like. The automatic re-authoring system and 
method convert the document into a number of pages, 



where each page is fully displayable with only at most 
a minimal amount of scrolling on the display screen of 
the PDA or cellular phone. At each stage of the re-au- 
thoring, a number of different transformations are ap- 
plied to the original document or a selected re-authored 
page. The selected re-authored page is the best page 
resulting from the previous re-authoring stage. The best 
page at each stage is determined based on the re-au- 
thoring parameters and the content of the document be- 
ing re-authored. 



/*2 



HTML provides sore degree 
device BuUpeodrncc 



Sectftool 

That are two twk tpgnmOm 
to tranlifinn of document* * 

Thn in • new pi^ect it FXPAL. 
We bm impkipemcd a fim 

cvtafibc ftoxy uchitixOBBJ 



to y 



lo-i 



Pap ntfe* 




^Section 1 


J ' 


— > 


hiML provide* 
somedttjwof 
device 



no 



f=h*. I 



CO 

m 
co 



5 



o 
o 

TJ 



Printed by Jouve, 75001 PARIS (FR) 



BNSDOCID: <EP 0949571 A2_l_> 



EP 0 949 571 A2 < v < 

Description 

rooon Thisinventonisdirectedtodocumentre-authoringsystemsandmethodsthalautom^ 

uZTe workl-wide web to disp.ay the documents appropriately on sma screen dev.ces, such as persona, 

%£S£S^"**>. IL. October 1994; G. Voelker et \^^'^^ST C ^ 

pS A Communicator and Samsung's Duett prov,de web access ca P ab,t,es from 
KlCSSlSi^ color montors having at least 640x480 resolution. Many pages are des.gned w, h ev n 
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Si Technologies already provide computational mobility and wireless connectivity, but the standard solutions to 

both inconvenient and contradicts the rationale for having electrons documents in the first place. Theresa five general 
a^STSSaying web documents on small screen devices: device-specific authoring. multiple.Jev.ee author- 
fno cTent side navigation; automatic re-authoring; and web page filtering. Device-specific «^<^ a ?2™" 

i IP link service which uses a proprietary mark-up language (HDML). 

S In miiDleSevice authoring, a range of target devices is identified. Then, mapp.ngs from a single source 
^LiZToUe^e^uL^ are defined to cover the devices wlhin the identified range. One examp e 
oUhTisthrs^tch^ 

^SS^SS^ Laboratory WWW Page, November 1995. In StretchText, portions of hed = en 
t*Zrtl down to the word level, can be tagged wrth a level of abstraction' measure. Upon receding the document 
Sirs can spS the level of abstraction they wish to view and are presented with the correspond^ deta,. or lack of 

Si Another example of multiple-dev.ee authoring is HTML cascading style sheets (CSS), as described in KUe 
L^-Cascadinq Style Sheets" WWW Consortium, September 1996. In cascading style sheets, a single style sheet 

*» Cerent structural portions of a document. For example, all 
head^can be defined to be displayed in red 18-point Times font. A series of style sheets may be attached to a 
^ZZrel Ts l^ describing that style sheet's desirability to the document's author. The user can a so 
«^^v a default stvle sheet The browser used by the user to access the distributed network can also define a defautt 
KS^^i^. sty* sheets n Jma.fy override the user's style sheets, the user can selecfvery ena e 
style sheets" providing the user with the ability to tailor the rendering of the document to the 

ZZTZZ Navigation, the user ,s ,ven the ^^^^T^^Z 
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Eg fre ^ oT»T^^ P^tn^. displayed at any given time. A very trMa, example of this is the use 
o scroll bars in ho document display area. A much more sophisticated approach is that taken ,n the PAD ++ system 
t linhed in B Benson et al "Pad ++ A Zooming Graphical Interface for Exploring Alternate Interface Phys.cs , 
P^dings 3 AC^ST 9?, acJ ^ess, 1994, in Ichthe user is free to zoom and pan the device display over 
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the document with infinite resolution. Active Outlining, as described in J. Hsu et al., "Active Outlining for HTML Docu- 
ments: An X-Mosaic Implementation", Second International World Wide Web Conference, Chicago, IL, October 1994, 
has also been implemented as a client-side navigation technique, in which the user can dynamically expand and col- 
lapse sections of the document under the respective section headings. Other techniques that fall into this category 
s include semi-transparent widgets, as described in T. Kamba et al., "Using small screen space more efficiently", Pro- 
ceedings, Computer-Human Interactions: CHI 96, Vancouver, BC, Canada, April 1996, and the Magic Lens system, 
as described in E. Bier et al., "Toolglass and Magic Lenses: The See-through Interface", SIGGRAPH '93 Conference 
Proceedings 1993. 

[0008] Automatic document re-authoring involves developing software that can take an arbitrary document, such as 
an HTML document, designed to be displayed on a desktop-sized monitor, along with characteristics of the target 
display device, and re-author the arbitrary document through a series of transformations, so that the arbitrary document 
can be appropriately displayed on the target display device. This process can be performed either by the client, by the 
server, or by an intermediary proxy server, such as an HTTP proxy server , that exists solely to provide these trans- 
formation services. An example of this latter approach is the UC Berkeley Pythia proxy server, as described in A. Fox 
et al., "Reducing WWW Latency and Bandwidth Requirements by Real-Time Distillation", Fifth International World Wide 
Web Conference, Paris, France, May 1996, which performs transformations on web page images. However, the focus 
of the Pythia proxy server is solely on minimizing page retrieval time. Spyglass Prism is a commercial product that 
performs automatic re-authoring of HTML documents using fixed transformations associated with page tags or em- 
bedded object types. For example, Prism will reduce all JPEG images by 50%. 

[0009] Finally, web page filtering lets a user see only those portions of a page that user is interested in. Filtering may 
be performed on an intermediate server, such as an HTTP proxy server, to conserve wireless bandwidth and device 
memory. However, filtering could also be performed by the client device as a display-management technique. Filter 
specifications can be based on keyword or regular expression matching, or on page structure navigation and extraction 
commands. Filtering can be either specified using visual tools or using a scripting language. 

[0010] Each of the five approaches, device-specific authoring, multiple-device authoring, client-side navigation, au- 
tomatic re-authoring and web page filtering, has specific benefits and drawbacks. Device-specific authoring will typically 
yield the best-looking results due to the direct involvement of human designer. However, device-specific authoring 
limits the user's access to a small, select set of documents that have been authored for that specific device. Multiple- 
device authoring, while requiring less total effort per document than device-specific authoring, still requires significantly 
more manual design work than simply authoring a single version of a document for a single desktop platform. Client- 
side navigation will work well if a good set of viewing techniques can be developed. However, client-side navigation 
.requires that the entire document be delivered to the client device at once, which may waste valuable wireless band- 
width and memory. Furthermore, the 'peephole' approach taken in PAD++ seems very awkward to use for large doc- 
uments, and the active outlining technique has limited applicability, as most web pages do not use a strict section/sub- 
35 section organization, or use that organization incorrectly. 

[0011] Automatic re-authoring is thus the ideal approach to providing broad access to web documents or other web 
content from a wide range of devices, if automatic re-authoring can be made to produce legible, navigable and aes- 
thetically pleasing re-authored documents without loss of information. 

[0012] This invention provides systems and methods that automatically re-author documents designed for a larger 
display area for display on a smaller display area. 

[001 3] This invention separately provides systems and methods that automatically transform a document into a plu- 
rality of linked subdocuments, where each subdocument requires less display area. 

[001 4] This invention separately provides systems and methods that automatically apply a plurality of different trans- 
forms to an original document to generate a plurality of sets of linked subdocuments. 

[0015] This invention further provides systems and methods that automatically apply the plurality of different trans- 
forms to at least one of the plurality of sets of linked subdocuments to generate additional linked subdocuments. 
[001 6] This invention further provides systems and methods that analyze a main subdocument of each set of linked 
subdocuments to determine a best one of the main subdocuments. 

[0017] This invention additionally provides systems and methods that determine if the best main subdocument can 
be displayed in the smaller display area, and if not, that apply further transforms to that main subdocument to further 
reduce the required display area. 

[0018] This invention separately provides systems and methods that filter a document to extract a desired portion 
of the document that is displayable in a smaller display area. 

[001 9] This invention separately provides systems and methods that filter a document to extract a described portion 
5 5 based on a predefined script. 

[0020] This invention separately provides systems and methods that generate scripts usable to filter a document to 
extract a desired portion. 

[0021] This invention separately provides a scripting language usable to write scripts for filtering a document to 
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r«S 3 In^etpTary embodiment, the document re-authoring systems and methods of this invention ^reimple- 
1 \ 1 a T H^P Troxv that dynamically re-authors requested web pages using a heuristic planning technique and 

dilentre-authorin^ according to the systems and methods of this ^^^^^^ 

n pdL However when the document re-authoring systems and methods of this invention ^ ^"^£52 
^disp^— 

25 XT "iCerconfe^roSand ubiquitous computing, the automatic document re-authoring and document 

wherein: 

Fig 1 illustrates re-authoring of a document into a section list page and a number of section pages according to 
one «S™2^embodiment o( the document re-authoring systems and methods of th.s invention, 
Rq 2"SrSs a layout table that can be re-authored into a plurality of linked cells acceding to one exemplary 
embodiment of the document re-authoring systems and methods of this invention; 

Urates how a document can be re-authored into different re-authored states based on apply ng different 
xH^o^l acTordfng to one exemp.ary embodiment of the authoring systems and methods of this mven- 

Fta 4 illustrates one exemplary embodiment of a control form for supplying display information to the HTTP proxy 
server accordinq to the document re-authoring system and method of th.s invention; 
£1 SaWol exempt embodiment of re-authoring an exempt document accordmg to the document 

:"b^gr fining one exempt embodiment o, a document re-authoring system ac- 
so ^9 ^^LUr^embodiment of the document version search space of the decument re-authoring systems 

R^IA -d" 1 1 B out.ine one exemplary embodiment of a method for decument authoring according to this 

fTi 2 to one exemplary embodiment of a method for performing e.ision transformation according to this invention 
3 f s one exeZS embodiment of a method for performing table transformation according to this ^ven, ^ 
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invention; 

Fig. 1 5 is a functional block diagram outlining one exemplary embodiment of a document re-authoring system 600 
of this invention including the document filtering according to this invention; 

Fig. 16 is one exemplary embodiment of the document flow during document filtering and re-authoring according 
5 to this invention: 

Fig. 17 shows an exemplary embodiment of using the document filtering systems and methods of this invention 
to navigate within the abstract syntax tree generated from the image shown in Fig. 10; and 
Fig. 18 illustrates further navigation within the abstract syntax tree of Fig. 10 according to the document filtering 
systems and methods of this invention. 

w 

[0029] In the following discussion of the document re-authoring and document filtering systems and methods of this 
invention, the terms "web page - , "web document 1 ' and "document - are intended to encompass any set of information 
retrieved as a single entity from a distributed network, such as an intranet, the Internet, the World Wide Web portion 
of the Internet or any other known or later developed distributed network. This information can include text strings, 
'5 images, tables of text strings and images, links to other web pages and formatting information that defines the layout 
of the text strings, images, tables and links within the web page. 

[0030] There are many possible automatic document re-authoring techniques, which can be categorized along two 
dimensions: syntactic vs. semantic techniques and transformation vs. elision techniques. Syntactic techniques operate 
on the structure of the document, while semantic techniques rely on some understanding of the content. Elision tech- 
v niques basically remove some information, leaving everything else untouched, while transformation techniques involve 
modifying some aspect of the document's presentation or content. Table 1 illustrates these dimensions, along with 
examples of each category: 



TABLE 1 



Examples of different types of automatic document re-authoring techniques 




Elide 


Transform 


Syntactic 


Section Outlining 


Image Reduction 


Semantic 


Removing Irrelevant Content 


Text Summarizing 



[0031] In order to gain an understanding of the processes required by an automated document re-authoring system, 
a study was conducted to assess the characteristics of typical web pages, and to identify candidate re-authoring tech- 
niques through the process of re-authoring several web pages by hand. 
35 [0032] A collection oftypicaP web pages, the Xerox Corporate web site : was initially selected to focus the study. This 
collection of 3,188 web pages is representative of a state-of-the-art, professionally-designed web site. A variety of 
statistics were collected on these pages using a web crawler, to help gain an understanding of the structure and content 
of a typical page. These statistics generally agree with other, larger-scale studies that have been performed across 
the entire web. 

40 [0033] Next, a subset of the pages in the Xerox web site was selected for manual re-authoring. A set of pages from 
the Xerox 1995 Annual Report was selected and converted by hand for display on a Sharp Zaurus PDA with a 320x240 
pixel screen. Detailed notes were kept of the design strategies and techniques used. 

Some of the design heuristics learned during this process were: 

45 

[0034] Keeping at least some of the original images is important to maintain the look and feel of the original document. 
Common techniques include keeping only the first image, or keeping only the first and last images, i.e., the bookend 
images, and eliding the rest. 

[0035] Section headers, i.e., the H1 - H6 tags in HTML are not often used correctly The section headers are more 
so frequently used to achieve a particular font size and style, such as, for example, bold, if the section headers are used 
at all. Thus, the section headers cannot be relied upon to provide a structural outline for most documents. Instead, 
documents with many text blocks can be reduced by replacing each text block with the first sentence or phrase of each 
block, i.e., first sentence elision. 

[0036] An initial rule of thumb for images is to reduce all images in size by a standard percentage, dictated by the 
55 ratio of the display area that the document was authored for to the display area of the target device. However, images 
which contain text or numbers can only be reduced by a small amount before their contents become illegible. 
[0037] Semantic elision can be performed on sidebars that present information which is tangential to the main con- 
cepts presented in a page. Many of the Xerox pages had such sidebars, which were simply eliminated in the reduced 
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Semantic elision can also be performed on images that do not contribute any information to the page, but 

a" 1 tables Banners primarily contain a set of images and a small number of navigat,on hnks, often on cnOhrt 
ZtTxo esfcblisTan aesthetic look, but contain little or no content. When space ,s at a prem,um, these can usually 
table pages are primarily sets of hypertext links to other page* anc ' £ 
additional content These link table pages can usually be re-formatted into a more compact form that just hsts the links 

ro^r^Lpace, which is takenforgrantedonalarged 

ETJiscoveredTor reducing the amount of whitespace in a page. Sequences o paragraph ■ . -e HTML^.^ 
breaks ie HTML "BR" tags, can be collapsed into one such paragraph or break. Lists, i.e., HTML UL OL .andtor 
SP tegs take up va.uable horizontal space with their indenting and bullets. These lists can be re-formatted .nto simple 
t«rt blocks with breaks between successive items, as described in Cooper etal. ,^ h „ iolloe : 

S»1 in conclusfon to perform document re-authoring two things are required: a set of re-authonng techniques^ 
Ta set Zl^ioZ M and a strategy for applying the page transformations. ^^^ u "^ 
manual re-authoring study, those most amenable to codification were the syntact.c elisor , techniques ndud nj sect™ 
omZa first sentence e. sion, and image elision, and the syntactic transformation techn.ques, including image s,ze 
„2n and font Se reduction. The design strategy .earned during the study inc.uded a ranking of the transformation 

that, and a set o, conditions under whteh each transformation or combmaUon of trans- 

J! and rJethods of this invention: a collects of individual re-authoring techniques that transform documents ,n 
7£™Z ^^* document re-authoring systems and methods that implement a design strategy by se- 
latino the best combination of techniques for a given document/display size pair. 

5£q Thi HS! ; Header Outlining transform provide a very good method for reducing the required display s,ze 
for structured documents such as technical papers and reports. The outlining process is shown in F,g. 1 
£<Z TsZZ F g 1 , the document 100 is converted into a Nst of sections page 110 and each section . e l.ded 
K 111 T"Lt is the contents 106 of each section 102 of the document 100 is e.tfed from the document 100 
Z IS^lnL^ 1 04 is converted into a hypertext link. When the hypertext link for any sec *n ,s 

secton ZaZ is determined and a.l content below that level, including tower-level section headers, ,s ehded, but 

t^^^^ blocks, even when no section headers are present, the First Sentence EJsion 
S2>rm S 'iSTgS way of reducing required screen area. In this technique, each text btock is ^ced wrth its 
first s^ence io! ^a.tematrvej. its first phrase up to some natural break point. This first sentence or phrase ,sa.sor^^^ 

40 ;^rT™d° S^r™L attempts to find page elements that can logical, be - 

aToSered or bordered L, sequences of paragraphs on tables. This transform takes an mput page, segmente he 
rL^entTnto sub oaoes by allocating some number of items to each, and builds and prepends an .ndex page to the 
IT^c of S^aX The Sed Segment transform then starts filling output pages with these elements ,n order 
un teach 1£^t2£»«* client's display size. If a single logical element cannot fit on a single output page 
the! the ln P de 9 xed Segment transform performs a secondary partitioning that partitions text blocks on paragraph or 

r^T'n^ndixed Segment transform, as much style information as possible is retained for the o^put element^ 
b^utDuZ each element embedded within a.l of its ancestor partitions' HTML tags. The Indexed Segmen transform 
^SSSS^ index page by copying a section header or first sentence from each element to be output, concate- 
^S^S^Z Jo an index page, and creating a hypertext link from each -pied portion <^££«£ 
suL page 1. should be appreciated that the index page itself may also need to be segmented Jn the Indexed ^Segment 
transZ, "Next' and 'Previous' navigate links between sequential sub-pages are also added for nav.gat.ona. con- 

^ C9 The Table transform recognizes when a table, i.e., the presentation of information arranged in a rectangular 
oridon a page cannot be directly sent to the client. In these cases, the Table transform generates one sub-page per 
ceH S a"op down left to-right order. Tables nested wrthin tables are processed in the same manner. The 
useTheu'tics to determine when table columns are being used as "navigational sidebars.' whch ,s 
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a common practice in commercial HTML web pages. In this case, the Table transform moves these cells to the end of 
the list of sub-pages, as these cells tend to carry very little content. 

[0049] Fig. 2 shows a nested table, marking tables with thicker borders than table cells. In the table 120 show in Fig. 
2, the cell 122 is identified a as sidebar and will be placed after the cell 128. All of the other cells are placed in their 

s natural order. The six portions of the cell 124, such as the subcells 125 and 126, are each placed in their own sub- 
page between the subpages containing the subcells 123 and 127, unless they contain only whitespace. 
[0050] As one can see from the example, nested tables and sidebars complicate the processing of tables. This is 
especially true if the sidebar is part of an inner table. In that situation, the sidebar should be moved to the end of the 
inner table, rather than to the end of any surrounding tables. In one exemplary embodiment of the document re-au- 

10 thoring systems and methods of this inventory, the sidebars are moved one table at a time and then all table cells are 
processed at once, rather than grouping the cells by table. 

[0051] Images present one of the most difficult problems for automatic document re-authoring, because the decision 
of whether to keep, reduce, or eliminate a given image should be based on an understanding of the content and role 
of the image on the page. However, Image Reduction transforms and Image Elision transforms can be applied without 
is content understanding, as long as users are provided a mechanism by which the users can retrieve the original images. 
In one exemplary embodiment of the systems and methods of this invention, the Image Reduction transform reduces 
all images in a page by one of a set of pre-defined scaling factors, such as 25%, 50%, and 75%, and making the 
reduced images into hypertext links that link the reduced images back to the original images. 
[0052] In addition to the Image Reduction transform, three Syntactic Elision transforms have also been developed 

20 for image, the Elide All transform, the First Image Only transform, and the Bookends transform. In the Elide All transform, 
all images are elided from the document. In the First Image Only transform, all but the first image are elided from the 
document. In the Bookends transform, all but the first and last images are elided from the document. The elided images 
are each replaced with their HTML "ALT" text when it is available. Alternatively, the elided images are each replaced 
with a standard icon when no ALT text is available. The ALT text or standard icon'for each elided image is also made 

2S into a hypertext link to that original image. 

[0053] In one exemplary embodiment of the document re-authoring systems and methods of this inventory, if screen 
space is too limited or the client device cannot display images, the images are removed from the document. However, 
the removed images may be used as anchors for hypertext links via a client-side image map. It should also be appre- 
ciated that if such images are removed, the web site represented by the HTML document can be rendered non-navi- 

30 gable. To accommodate this, in one exemplary embodiment of the document re-authoring systems and methods of 
this inventory, a transform that extracts the hypertext links from such images and formats them into a text list of link 
.anchors is used. The labels for the text list are extracted from the HTML "ALT" tags of the image map, if present, or 
from part of the Uniform Resource Locator of the link. This transformation preserves links attached to images for 
navigation when removing the images. 

35 [0054] The overall process of deciding which combination of transforms to apply to a given page for a given client 
display seems at first to require some form of human artistic ability. However, the automatic document re-authoring 
systems and methods of this invention capture many of the heuristics used in the manual re-authoring exercise, and 
do a fairly good job of producing good-looking pages for a given display. 

[0055] Individual page transformations are ordered by their desirability. In order to determine which combination of 
40 transformations should be applied to a given document, the document re-authoring systems and methods of this in- 
vention performs a depth-first search of the document transformation space, using many heuristics that describe pre- 
conditions for transformations and combinations of transformations. The depth-first search ensures that a "good 
enough" version of the document is found by using a combination of the most desirable transformations. Only if the 
more desirable transformations are not applicable or do not reduce the document enough, are the less favored trans- 
45 formations used. 

[0056] The document re-authoring systems and methods of this invention search a document transformation space 
in a best-first manner. Each state in this search space represents a version of the document, with the initial state 
representing the original 'as-authored' document. Each state is tagged with a number representing a measure of merit 
that represents the quality of the document version at that state. The measure of merit, i.e., the evaluation function or 
50 value, for each state is a rough estimate of the screen area required to display the entire document as that document 
exists in that state. A state can be expanded into a successor state by applying a single transformation technique to 
the re-authored document as it exists in that state. 

[0057] At every step in the search process, the most-promising state of the document, i.e., the state with the smallest 
current display area requirements, is selected and a transformation is applied to transform the document from its current 
55 state to a more promising state of the document, if possible. As soon as a state is created that contains a document 
version that is 'good enough', the search can be halted and that version of the document is returned to the client device 
for rendering. Alternatively, the search is continued until all content of the original page is contained or represented in 
a set of good-enough subpages. If the search is exhausted and no document version can be found that is good enough, 
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then the best document found during the search is returned to the client device for rendering. If there are hard size 
constraints that are not met by the best document, a more destructive transformation is applied that breaks documents 
up in the middle of paragraphs. 

[0058] Fig 3 shows how different transformations applied to a document 200 result in different resulting re-authored 

s sub-pages 210 220 and 230. Depending on the information supplied by the user to the systems and methods of this 
invention one of the sub-pages 210, 220 and 230 would be selected as the "best" re-authored page. Then, if further 
re-authoring is required, for example, to generate good-enough subpages for the content removed from the first sub- 
page or if the best sub-page is not yet "good enough", additional transformations could be applied to the subpages 
resulting from the selected best re-authored sub-page 210, 220 or 230 or to further re-author the selected best re- 

10 authored subpage 21 0, 220 or 230. 

[0059] Heuristic information is used in several places by the document re-authoring systems and methods according 
to this invention, including: the order in which various transformation techniques are applied to a given state; the pre- 
conditions for each transformation technique; and the determination of when a document version or subpage is 'good 
enough' In general, transformations which make minor changes to the document are preferred over those which make 

is more extensive changes. For example, reducing images by 25% is preferable to reducing the images by 75%. 

[0060] The pre-conditions for each transformation technique specify the other transformations with which that trans- 
formation can be combined. For example, it makes no sense to apply both full outlining and first sentence elision to 
the same document. The preconditions also specify the requirements on the content and structure of the document 
that the technique is being applied to. For example, the Full Outlining transform should be applied only when there are 

20 at least three section headers in the document or sub-page being re-authored. The current condition for 'good enough' 
is fairly simplistic That is, the search is stopped when the area required by a document or sub-page is a predetermined 
multiple of the screen area of the client display. In general, this predetermined multiple is greater than I, and, tn one 
exemplary embodiment, is 2.5. This higher multiple merely assumes that the user doesn't mind scrolling the display a 
little in one direction. 

25 [0061] When a transformation is applied to a document it can result in the document's contents being split into mul- 
tiple smaller "sub-pages-, as shown in Fig. 2. However, each of these sub-pages may still be too large to download 
and display on the client. To address this problem, the document re-authoring systems and methods of this invention 
keep a list of the sub-pages generated by each sequence of transformations attached to the state representing the 
resulting document version. Once the good-enough version of the document is selected, which is really only a good- 

30 enough version of the first sub-page delivered to the client, the list of generated sub-pages for that version is added 
to a global list of pages to be re-authored. The document re-authoring systems and methods of this invention then re- 
author each of these to-be-re-authored pages until all of the resulting sub-pages can be delivered to the client. This 
procedure is shown in pseudocode below, where "reauthor' refers to the best-first re-authoring process described 
above for a single input page. 



35 



40 



45 



Digestor(initial_page) 

to_be_reauthored - { initial_page } 

to_deliver- {} 

while(to_bc_reauthored != {}) 

next jage - pop(to_be_reauthored) 
best_version_state * reauthoi(next_page) 
tojieliver.append(best_vereion_$tate.pagc) 
tobe reauthorcd.appcnd(best_version_state.sub_pages) 

return ^deliver 

[0062] All re-authored sub-pages are cached as transformed parse trees. As the user navigates a transformed doc- 
so ument and requests subpages, the corresponding parse trees are rendered and sent to the client. 

[0063] The document re-authoring systems and methods of this invention re-author document by first parsing the 
document and constructing a parse tree or abstract syntax tree (AST) representation of the document. The document 
re-authoring systems and methods of this invention then apply a series of transformations to the parse tree. Then, the 
document re-authoring systems and methods of this invention map each resulting transformed parse tree back into a 
55 document representation, which may be in a document format that is different from the input format of the original 

document. , 
[0064] Document transforms are implemented using a standard procedure that includes a condition function that 
takes a state node in the document version space and returns true if the transform should be applied to the state, and 
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an action function that is called when the transform is actually applied to a state to produce a new state containing a 
new document version, a new measure of quality and the resulting sub-pages. Three types of transforms can be 
defined — 1) those which are always run on a page before the planning process starts; 2) those used in the best-first 
planning process; and 3) those which are always run on a page before it translated from the final abstract syntax tree 

5 back into a surface form such as HTML. 

[0065] Transformations manipulate the parse tree, in the state they are applied to, in order to produce a new version 
of the document. The manipulations are similar to those described in S. Bonhomme et al., "Interactively Restructuring 
HTML Documents", Fifth International World Wide Web Conference, Paris, France, May 1996. Whenever portions 
ofthe parse tree are elided or transformed, an HTML hypertext link is added into the parse tree to reference the node 

io identifiers of all affected parse tree subtrees, enabling users to request the original portions of the document that have 
been modified during re-authoring. 

[0066] The document re-authoring systems and methods of this invention also keep track of which combinations of 
transforms have already been tried, via a global list of transform sets, assuming that all transformations are commu- 
tative, to ensure that no duplicate states are ever constructed. 

15 [0067] One exemplary document re-authoring system and method according to this invention, as described above, 
has been implemented as an HTTP proxy server. The HTTP proxy server accepts a request tor an HTML document, 
retrieves the document from the specified HTTP server, parses the HTML document, constructs the parse tree, or 
abstract syntax tree, from the retrieved HTML document, labels each of the parse tree nodes with a unique identifier, 
and then retrieves any embedded images so that the size of the retrieved images can be determined, as necessary. 

20 Once this has been accomplished, the document re-authoring systems and methods of this invention are initialized 
with a state containing the parse tree for the original retrieved document. During each re-authoring cycle, the document 
re-authoring systems and methods of this invention select the state with the best document version so far, then select 
the best applicable transformation technique and apply the selected transformation, resulting in a new state and a new 
document version being generated. It is assumed that the convolution of transformations is always commutative, and 

25 several checks are used by the re-authoring software systems and methods of this invention to ensure that redundant 
states are not constructed. 

[0068] In one exemplary embodiment of the document re-authoring systems and methods of this invention, fifteen 
transformation techniques were implemented: FullOutline, Outline ToH1, Outline ToH2, Outline ToH3, Outline ToH4, 
Outline ToH5, Outline ToH6, FirstsentenceElision, Reducelmages25%, Reducelmages50%, Reducelmages75%, 

30 ElideAlllnmges, FirstlmageOnly, Bookendlmages, and ReduceFontSize. 

[0069] This exemplary embodiment of the document re-authoring systems and methods of this invention has been 
^ implemented in the Java programming language. In addition to functioning as a true proxy server, this HTTP proxy 
server system can also respond to requests for certain uniform resource locators with documents generated by the 
HTTP proxy server itself. This is used to provide the user with forms-based control over the HTTP proxy server and 

35 the document re-authoring systems and methods. This exemplary embodiment of the document re-authoring system 
can process even very compex pages in less than 2 seconds on a 200Mhz Pentium, using Symantec's Java JIT com- 
piler. 

[0070] The first thing that a user of the document re-authoring software systems and methods of this invention must 
do is specify the size of display for the device being used and indicate the font size of the default browser font being 
40 used. This information is needed in order to estimate the screen area requirements of text blocks. To do this, the user 
requests a specific control uniform resource locator from the HTTP proxy server, resulting in delivery of the form 300 
shown in Fig. 4. 

[0071] Once a user has configured the document re-authoring system, the user can start retrieving documents from 
a distributed network, such as the World Wide Web. The original page 400 and the re-authored page 410 shown in 

45 Fig. 5 illustrate the re-authoring capability of the document re-authoring systems and methods of this invention. In this 
example, this exemplary embodiment of the document re-authoring systems and methods of this invention chose to 
use 25% image reduction in combination with first sentence elision to render the displayed page 410 from the original 
page 400. The re-authored page 410 is then displayed on a browser window 420. In this exemplary embodiment of 
the re-authoring systems and methods of this invention, immediately following retrieval of a page, the user can request 

50 a trace of the re-authoring session to determine which transformations had been applied, by requesting another control 
uniform resource locator from the HTTP proxy server. 

[0072] Fig. 6 shows one exemplary embodiment of an environment 500 in which the automatic document re-authoring 
systems and methods and/or the automatic document filtering systems and methods of this invention will be imple- 
mented. As shown in Fig. 6, the environment 500 includes a limited display area device 510 that has a display having 
55 a display area that is significantly limited relative to the display area of a monitor for a desktop or a laptop computer. 
As shown in Fig. 6, the environment 500 further includes a transmitter/receiver communication system 550, a host 
node 570 of a distributed network and the remaining portions 590 of the distributed network. 

[0073] In the environment 500, the limited display area device 510 will normally be a personal digital assistance 



9 



BNSDCCID; <EP 094957 1A2_L> 



EP 0 949 571 A2 

(PDA) a cellular phone or the like that is connected by a wireless communication channel 530 to the transmitter/ 
receiver communication system 550. Thus, as shown in Fig. 6, the limited display area device 510 will normally include 
an antenna 520, white the transmitter/receiver communication system 550 will normally include a corresponding an- 
tenna 540 The limited display area device 510 will normally communicate with the transm -^/receiver communication 

5 system 550 over the wireless communications channel 530 using radio frequency sigr.n-, transmitted between the 
antennas 520 and 540. . 
[0074] The transmitter/receiver communication system 550 converts the analog or digital signals received from the 
limited display area device 510 over the communications channel 530 in to a form usable by the host node 570 of the 
distributed network The transmitter/receiver communication system 550 then outputs the signals received over the 

jo communications channel 530 over a communication link 560 to the host node 570 of the distributed network. It should 
be appreciated that the communication link 560 can be any known or later-developed communication structure capable 
of transmitting the appropriate signals between the transmitter/receiver communication system 550 and the host node 
of the distributed network 570. Because the exact structure of the transmitter/receiver communication system 550 and 
the communication link 560 will be a matter of design choice depending upon how these elements are implemented, 

is but such design choices will be readily apparent and predictable to those of ordinary skill in the art, these elements 
will not be further described. 

[0075] It should also be appreciated that the limited display area device 51 0 can also be connected to the host node 
570 of the distributed network by other than the wireless communication channel 530, such as a communication link 
522 That is, the communication link 522 could be any other known communications structure, such as a local area 

20 network a wide area network, a modem connection over the public switched telephone network or a cable television 
system or the like For example, the user of the limited display area device 510, rather than communicating over the 
wireless communication channel 530, could connect the limited display area device 51 0 to the public switch telephone 
network using a modem. The user would then dial directly into the host node 570 of the distributed network. 
[0076] Regardless of how the host node 570 of the distributed network is ultimately connected to the limited display 

25 area device 510, once the host node 570 of the distributed network receives a request for a document to be transmitted 
to the limited display area device 510, the host node 570 of the distributed network first determines if the requested 
document is located locally on the host node 570 of the distributed network. If the requested document is not located 
locally, the host node 570 of the distributed network communicates over a communication structure 580 to the remaining 
portions 590 of the distributed network to request the document. The particular node of the remaining portions 590 of 

30 the distributed network storing that document ultimately will receive the request from the host node 570 over the com- 
munication structure 580 and will return the requested document to the host node 570 over the communication structure 
580 It should be appreciated that the communication structure 580 can be any known or late r<Jeve loped communi- 
cation structure and protocol system for linking together widely located nodes of a distributed network. 
[0077] Once the host node 570 of the distributed network receives the requested document, an HTTP proxy server 

35 executing on the host node 570 of the distributed network re-authors the requested document based on the previously- 
provided information about the limited display area device 510. A first re-authored page is ther transmitted by the host 
node 570 over either the wireless communication link 530 or the communication link 522 to the limited display area 
device 510 As the user reviews the delivered page, the user may determine that viewing additional information removed 
from the re-authored page is required. In this case, the user will send a request over one of the wireless communication 

40 link 530 or the communication link 522 to the host node 570 of the distributed network to obtain the desired re-authored 
sub-page. The host node 570, in response to this request, transmits a further re-authored sub-page of the original 
document to the limited display area device 510 over one of the wireless communication channel 530 or the commu- 
nication link 522. 

[0078] Fig 7 shows this information flow in greater detail. As shown in Fig. 7, when the user of the limited display 
45 area device 51 0 wishes to review a particular document presiding on a distributed network, the user sends a request 
for the particular document from the limited display area device 510 to an HTTP proxy server 571 residing on the host 
node 570 of the distributed network. The HTTP proxy server 571 then transmits the request for the particular document 
to the particular remote node 591 on the distributed network that stores the requested page. The particular remote 
node 591 returns the requested original document to a document re-authoring system 600 residing on the HTTP proxy 
so server 571 The document re-authoring system 600 re-authors the original document into a plurality of sub-documents 
that are each capable, as closely as possible, of being displayed on the limited display area device 510. The document 
re-authoring system 600 then delivers the first re-authored to page to the limited display area device 510, while the 
other re-authored sub-pages are stored in a re-authored sub-page cache 636 of the document re-authoring system 
600 Thus when the user of the limited display area device 510 wishes to view information residing on one of the re- 
55 authored sub-pages stored in the re-authored sub-page cache 636, the user causes the limited display area device 
510 to transmit a request for that sub-page. The requested cached sub-pages are delivered from the re-authored sub- 
page cache 636 to the limited display area device 510. 

[0079] It should be appreciated that, while the HTTP server 571 . the document re-authoring system 600 and the re- 
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authored subpage cache 636 are shown in Fig. 7 as independent elements, in general, these elements will be imple- 
mented as different portions of a single entity, such as different modules of a single software application. 
[0080] Fig. 8 is a functional block diagram outlining in greater detail one exemplary embodiment of the document 
re-authoring system 600. As shown in Fig. 8, the document re-authoring system 600 includes a controller 61 0, an input/ 
5 output interface 620, a memory 630, an abstract syntax tree generating circuit 640, a document size evaluation circuit 
650, a transform circuit 660 and a tree-to-document remap circuit 670 : each interconnected by a data/control bus 680. 
The communication links 522, 560 and 580 discussed above with respect to Fig. 6 are each connected to the input/ 
output interface 620. 

[0081 ] The memory 630 includes a number of functionally distinct portions, including an original page memory portion 
10 631 , a display device size memory portion 632, an abstract syntax tree memory portion 633, a search space portion 
634, a transform memory 635, the re-authored page cache 636 described above with respect to Fig. 7, and a sub- 
pages to be re-authored list 637. The original page memory portion 631 stores the returned original document returned 
from the remote node 591 of the distributed network that stores the page requested by the limited display area device 
510. 

75 [0082] The display device size memory 632 stores a number of form documents used by the document re-authoring 
system 600 to obtain various parameters about the limited display area device 510 used by the document re-authoring 
system 600 to re-author a page for a particular limited display area device 510. The display device size memory 632 
also stores the particular size parameters for at least one limited display area device 510. It should be appreciated 
there are a number of different possible ways of implementing the document re-authoring system 600 relative to the 

20 various parameters about the limited display area device 5 1 0. In one exemplary embodiment, the document re-author- 
ing system 600 can store the various parameters for a particular limited display area device 510 only for as long as 
that limited display area device 510 remains continuously connected to the document re-authoring system 600. In this 
case : each time a particular limited area device 510 is reconnected to the document re-authoring 600, the document 
re-authoring system 600 would send the various forms used to obtain the various parameters about the limited display 

25 area device 510 and the user would be required to re-supply these various parameters each time the document re- 
authoring system 610 was initially accessed. 

[0083] While this reduces the required size for the display device size memory 632 and does not require any system 
for identifying a particular limited display area device 510, this system places a larger burden on the user of the limited 
display area device 510 or requires a process for automating the supply of information from the limited display area 
30 device 51 0 to the document re-authoring system 600. This automation could be provided, for example, by the document 
re-authoring system 600 requesting the information from the limited display area device 510. If the information has 
already been entered by the user during a previous session with the document re-authonng system 600, and that 
information was stored at that time on the limited display area device 510, the user would not need to be actively 
involved in re supplying the information to document re-authoring system 600. 
35 [0084] Alternatively, the information could be stored in the display device size memory 632, along with an identification 
code that the user can cause to be supplied from the limited display area device 510 when beginning a session with 
the documen re-authoring system 600. By supplying the identification code to the document re-authoring system 600, 
the user again would not be required to re-supply all of the various parameters about the limited display area device 
510 each time the document re-authoring system 600 is accessed 
40 [0085] In any case, the document re-authoring system 600 uses the various parameter about the limited display area 
device 510, as described above, when re-authoring the original page stored in the original page memory 631 so that 
. each re-authored page will fit, as closely as possible, on to the small display area of the limited display area device 510. 
[0086] The abstract syntax tree memory portion 633 stores the abstract syntax tree generated form the original 
document stored in the original page memory 631 by the abstract syntax tree generating circuit 640. The transform 
45 memory portion 635 stores the various transforms described above, as well as the conditions under which each trans- 
form can be applied and the conditions regarding which transforms are not usable with various other ones of the 
transforms. The transform memory 635 also stores an indication of the desirability of applying any particular transform 
to a particular original or re-authored page. That is, as described above, the various transforms have general order 
that emphasis applying a more limited transform, such as reducing an image by a small about, over a more radical 
50 transform, such as reducing an image by a large amount or removing the image completely. 

[0087] The re-authored page cache 636 stores the abstract syntax tree corresponding to each re-authored page or 
sub-page as the document size evaluation circuit indicates that the abstract syntax tree for a particular re-authored 
page or sub-page is good enough, based on the various parameters about the limited display area device 510 stored 
in the display device size memory 632. The sub-pages to be re-authored list 637 stores the abstract syntax trees for 
55 those sub-pages generated by transforming the original document or an earlier sub-page. These sub-pages will gen- 
erally contain the images of any reduced-size images or any elided images, as well as the full text of any text segments 
that have had content elided from them. 

[0088] Finally, the search space memory 634 stores a number of states generated by the transform circuit 660 as it 
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applies the various transforms stored in the transform 635 to either the original document stored in the original page 
memory 631 or to various sub-pages stored in the sub-pages to be re-authored list 637, based on the part.cular state 
of the search space currently being manipulated. ^ ir ^ 
r0O891 In particular, each state i in the search space 634 includes an evaluation value pomon, a transformed abstract 
syntax tree portion and a sub-page list portion. The evaluation valued portion stores the evaluation value generated 
for the re-authored page or sub-page corresponding to the state i generated by the document size evaluation circurt 
650 The transformed abstract syntax tree portion stores the transformed abstract syntax tree for the state . generated 
by the transform circuit 660 by app^ing one of the transforms in the transform memory 635 to the parent ste e to he 
state i The sub-page list portion stores the list of sub-pages generated to store any original content removed from the 
page corresponding to the state i when the transform circuit 660 applies the particular transform used to generate that 
state i 

ro0901 It should be appreciated that state 0 corresponds to the original document stored in the original page memory 
631 In particular, the evaluation value portion of state 0 corresponds to the evaluation value generated for the ongmal 
docu-ent before any re-authoring. In this state 0, the transformed abstract syntax tree portion stores the original un- 
trar^rmed abstract syntax tree generated by the abstract syntax tree generating circuit for the ong.nal document. 
Finally, before state 0, the sub-page list will be empty, as the original document contains all ofthe ongmal information 
and therefore, no sub-pages are required. 

r0O9n Fiq 9 graphically illustrates various states stored in the search space memory portion 634. In particular Fig 
9 shows a document comprising a section header, a lext paragraph, and an image. As shown in Fig. 9, in the m.tal 
state i e state 0, the original document has not been transformed. This initial state also shows the original rating ,, 
e the evaluation value, generated for the original document. Fig. 9 also shows the state 1 generated from the state 
Oby applying the "elide all images" transformed to the document of state 0. As shown in state 1 , the re-authored sub- 
page of state 1 contains the section header and the text but does not contain the image. Rather, in place of he image, 
the re-authored sub-page of state 1 contains a link labeled "IMG" that links the re-authored page of state 1 to the sub- 
paqe storing the image elided from the re-authored subpage of state 1 . State 1 also indicates the evaluat.on value for 
This re-authored document. As shown in Fig. 9, the size requirements for the authored page are now one-quarter 
of the size requirements of the original, un-re-authored page. 

r00921 Fiq 9 also indicates that two additional states, state 2 and state 3 : were generated by applying other transforms 
to the document of state 0. Finals Fig. 9 shows three additional states, state 4. state 5 and state 6, generated by 
applying addrtional transforms to the re-authored document of state 1 or to the sub-page of state 1 For example, rf the 
sub-page containing the image is still too large to be displayed on the limited display area device 510, an mtermediate 
sub-page generated by applying the "reduced image by 25%", the "reduce image by 50%", or the ^uce image , by 
75%" transforms to the image to obtain a re-authored document good enough to be displayed on the limited display 
area device 510 

r00931 Currently in operation, the document re-authoring system 600 of Fig. 8 receives the returned original docu- 
ment over the communication link 580. The received or general document is input through the input/output interface 
620 and is stored in the original page memory 631 under the control of the controller 610. Then, the abstract syntax 
tree aeneratinq circuit 640, under control of the controller 610, inputs the original document from the original page 
memory portion 631 and generates an abstract syntax tree from that original document. The abstract syntax tree 
generated by the abstract syntax tree generating circuit 640 is then stored in the abstract syntax tree memory portion 
633 of the memory 630 under control of the controller 61 0. . 
r00941 The document size evaluation circuit 650 then inputs, under control of the controller 610, the abstract syntax 
tree corresponding to the original document stored in the original page memory 631 and the various parameters from 
the display device size memory 632 about the particular limited display area device 510 to wh.ch the re-authored 
documents are to be returned. The document size evaluation circuit 650 then generates an evaluation value and s ores 
that evaluation value in state 0 of the search space memory portion 634. The document size evaluat.cn circuit 650 
also outputs an indication to the controller 610 whether the document of state 0 is good enough for outputt.ng it to the 
timited display area device 510 over one of the communication links 522 or 560. If the original document is already 
good enouqh, the original document is immediately returned without further transformation. 

r00951 Then the transform circuit 660, under control of the controller 610, inputs the document of state 0, as repre- 
sented by the abstract syntax tree for that state, and applies one of the transforms stored in the transform memory 
635 to the abstract syntax tree of the input state. In particular, the transform circuit 660 first determines, for the current 
state i whether the selected transform should be applied to the current state i of the document. For example, as 
described above if the current state i of the document does not contain any images, there is no point .n applying any 
of the image reduction or elision transforms to this state of the document. Furthermore, if the "elide all but first image 
transform has already been applied to obtain the current state i of the image, there is no point of applying the elide 
all but first and last images" transform to this current state i. ,. u , . 

[0096] Assuming the current transform selected by the transform circuit 660 is properly applicable to the current state 
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i of the document, as indicated by the transformed abstract syntax tree for the current state i, the transform circuit 660 
applies that transform to the abstract syntax tree for that state to generate a child state j. The child state j includes the 
further transformed abstract syntax tree and a sub-page list indicating the sub-pages that remain to be transformed 
based on the content elided from the original document necessary to reach this child state j. Finally, the document size 
5 evaluation circuit 650, under control of the controller 610, evaluates the document obtained in the child state j to de- 
termine if that resulting document is good enough for outputting to the limited display area device 510. That evaluation 
value is then stored in the newly-created child state j. 

[0097] After the transform circuit 660 has generated the new child state j, the transform abstract syntax tree for that 
state j is output to the document size evaluation circuit 650 for evaluating the size requirements of the document 

10 corresponding to the state j. 

[0098] Once the abstract syntax tree for the first page of the transformed document is determined to be good enough, 
that abstract syntax tree is output to the tree-to-document remap circuit 670, which renders the first re-authored sub- 
page from that abstract syntax tree. That first re-authored sub-page is output from the tree-to-document remap circuit 
670 to the input/output interface 620 and ultimately is transmitted to the limited area display device 510. At the same 

is time, the transform circuit 660 continues to apply additional transforms to any subpages resulting from transforming 
the original document into the first good-enough re-authored subpage. As each such subpage is transformed into a 
good-enough subpage, the abstract syntax tree for each such good-enough subpage is stored in the re-authored page 
cache 636 until a request for that subpage is received by the document re-authoring system 600 from the limited area 
display device 510. 

20 [0099] Once a request for that subpage is received by the document re-authoring system 600, the abstract syntax 
tree for that requested subpage is output to the tree-to-document remap circuit 670, which renders the requested re- 
authored sub-page from that abstract syntax tree. That requested re-authored sub-page is output from the treeto- 
document remap circuit 670 to the input/output interface 620 and ultimately is transmitted to the limited area display 
device 510. 

25 [0100] It should be understood that each ofthe circuits and other elements shown in Figs. 6-8 can be implemented 
as portions of suitably programmed general purpose computers. Alternatively, each of the circuits shown in Figs. 6-8 
can be implemented as physically distinct hardware circuits within one or more ASICs, or using FPGAs, PDLs, PLAs, 
or PALs, or using discreet logic elements or discreet circuit elements. The particular form each of the circuits shown 
in Figs. 6-8 will take is a design choice and will be obvious and predictable to those of ordinary skill in the art. 

30 [0101] It should also be appreciated that the links 522, 560 and 580 can by any known or later<teveloped device or 
system lor connecting the limited display area device 51 0 to the host node 570 or the host node 570 to the transmitter/ 
^ receiver communication system 550 or the remaining portions 590 of the distributed network. Thus, the links 522, 560 
and 580 can each be implemented as a direct cable connection, a connection over a wide-area network or a local- 
area network, a connection over an intranet, or a connection over the Internet. In general, the links 522, 560 and 580 

35 can be any known or later-developed connection system or structure usable to connect the corresponding apparatus 
to the host node 570 over the distributed network. 

[0102] It should further be appreciated that the document re-authoring system 600 is preferably implemented on a 
programmed general purpose computer. However, the document re-authoring system 600 can also be implemented 
on special purpose computer, a programmed microprocessor or microcontroller as a peripheral integrated circuit ele- 
40 ments, and ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a 
discreet element circuit a programmable logic device such as PLD, PLA, FPGA or PAL, or the like. In general, any 
device, capable of implementing a finite state machine that is in turn capable of implementing the flowcharts shown in 
Figs. 11 A-14, can be used to implement the document re-authoring system 600. 

[0103] The memory 630 shown in Fig. 8 is preferably implemented using static or dynamic RAM. However, the 
45 memory 630 can also be implemented using a floppy disk and disk drive, a writeable optical disk and disk drive, a hard 
drive, flash memory or any other know or later-developed volatile or non-volatile alterable memory. In addition, the 
memory 630 can further include one or more portions storing control programs for the controller 610. In general, such 
control programs are preferably stored using non-volatile memory, such as flash memory, a ROM, a PROM, and 
EPROM or EEPROM, a CD-ROM and disk drive, or any other known or later-developed alterable or non-alterable non- 
50 volatile memory. 

[0104] Fig. 1 0 shows another exemplary original document and the abstract syntax tree that is generated from that 
document. As shown in Fig. 10, the document includes an image, a table having two rows of three columns each, and 
a text paragraph. The resulting abstract syntax tree generated from this page includes a root node labeled "Page". 
Three intermediate nodes, "Image", "Table" and "Paragraph" corresponding to each of the image, the table and the 
55 text paragraph, respectively, extend from the root "Page" node. Furthermore, as shown in Fig. 10, two intermediate 
nodes, "Row 1" and "Row 2", corresponding to each of the two rows, respectivley, extend from the intermediate "table" 
node. Finally, three nodes, corresponding to each of the three cells in each row, respectively, extend from each of the 
"Row 1" and "Row 2" nodes. 
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[0105] To re-author the page shown in Fig. 1 0, for example, the first transform to be applied would generally replace 
the full size image with a node representing an image reduced by 25%. Then, a new abstract syntax tree having a root 
node corresponding to the full-sized image would be formed and linked by a hypertext link to the reduced image node 
of the transformed abstract syntax tree. If the re-authored page having the image reduced by 25% is not yet good 

5 enough, the image reduction transformation reducing the image by 50%, 75% and then completely removing the image 
would be applied.in turn to the original document until a good-enough image was obtained. In each case, the abstract 
syntax tree would contain a link from the transformed node corresponding to the image to the separate abstract syntax 
tree containing the full-sized image. If removing the image completely is still insufficient to result in a good-enough re- 
authored document, the table transform can be applied, as described above, to transform the table into a set of linked 

10 individual cells, or the First Sentence Elision transform can be applied to move the text paragraph into a separate 
subpage. 

[0106] Figs. 11 A and 11 B are a flowchart outlining one exemplary method for re-authoring a page according to this 
invention. As shown in Fig. 11 . control begins in step S1 00 and continues to step S11 0, where a user connects a device 
having a limited display area to a re-authoring system according to this invention. Then, in step S120, the re-authoring 
75 system transmits one or more parameter forms to the user to obtain the necessary information about the limited display 
area necessary to be able to re-author a requested page for display on the limited display area device. Then, in step 
SI 30, the re-authoring system inputs the parameter information from the user and stores the input parameter informa- 
tion in a memory. Control then continues to step Si 40. 

[0107] As indicated above with respect to Figs. 6 and 7, the parameter information gathering process outlined in 
20 steps S120 and S130 can be automated so that the user does not have to be actively involved in performing steps 
S120 and S130. Alternatively, as shown in optional step S135, steps S120 and 130 can be replaced by step S135. In 
step S135, the user either actively inputs, or the limited display area device automatically outputs, an identification 
code to the re-authoring system identifying previously-stored parameter information for this particular limited display 
area device. Control then again continues to step S140. 
2S [0108] In step S140, a request for a document on the distributed network is output to the re-authoring system from 
the user using the limited display area device. Then, in step S150, the re-authoring system obtains the requested 
document from the distributed network. Next, in step S1 60, the obtained document is parsed to build an abstract syntax 
tree of that document. Then, in step S170, an evaluation value for the obtained original document is generated from 
the abstract syntax tree. Control then continues to step S180. 
30 [0109] In step S180, the evaluation value is analyzed to determine if the obtained document is good enough to be 
displayed on the limited display area device without any re-authoring. If so, control jumps to step S340. Otherwise, 
control continues to step S190. 

[0110] In step S1 90, one or more pre-re-authoring transforms are applied to the abstract syntax tree of the obtained, 
original document. These pre-re-authoring transforms are used, for example, to remove portions of the original docu- 

35 ment that do not contain any content but that consume display area For example, such portions of the obtained doc- 
ument include banners and other graphical elements that are merely identifying links to other pages or portions of the 
page. These contentless images are replaced by text links. However, because such transforms do not actually remove 
any content from the image, re-authoring the page in this way does not require the removed portions to be retained. 
Other portions that can be removed without effecting the content of the original document include formatting commands 

40 that add whitespace and other contentless esthetic formatting to the original document. Finally, other transforms can 
be applied that convert the various fonts of a document to a single standard font to eliminate unnecessary display area 
requirements of large and complicated fonts. 

[0111] Once the pre-re-authoring transforms are applied in step S190, control continues to step S200, where an 
evaluation value for the pre-re-authored original document is generated. Then, in step S210, the pre-re-authored doc- 

45 uments evaluation value is checked to determine if the pre-re-authored document is good enough to be displayed on 
the limited display area device. If so, control again jumps to step S340. Otherwise, control continues to step S220. 
[0112] In step S220, state 0 of the search space, corresponding to the pre-re-authored document, is selected as the 
current stale of the search space. Then, in step S230, a first transform is selected as the current transform. Then, in 
step S240, a determination is made whether the current transform can be applied to the abstract syntax tree of the 

so current state. As outlined above, various ones of the transforms have conditions that indicate whether that transform 
can be efficiently applied to the current re-authored document or whether the current transform is properly combinable 
with previously applied transforms. If the current re-authored document corresponding to the current state is such that 
the current transform can be efficiently applied and does not conflict with any previously applied transforms ; control 
continues to step S250. Otherwise, control jumps to step S290. 

55 [0113] In step S250, the current state is transformed to a child state using the current transform and the resulting 
child state, including the transformed abstract syntax tree and any resulting sub-pages, are added to the search space. 
Then, in step S260, an evaluation value is generated for the document corresponding to the transformed abstract 
syntax tree corresponding to the child state generated in step S250. Next, in step S270, the evaluation value is analyzed 
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to determine if the document corresponding to the child state generated in step S250 is good enough to be displayed 
on the limited display area device. If the evaluation value indicates the re-authored document or sub-page is good 
enough, control jumps to step S310. Otherwise, control continues to step S280. 

[0114] In step S280, a determination is made whether all transforms have been applied to the current state. If all of 
5 the transforms have not been applied, control continues to step S290. Otherwise, control jumps to step S300. 

[0115] In step S290, the next transform is selected as the current transform and control jumps back to step S240. In 
contrast, in step S300, the state of the search space having the best evaluation value is selected as the current state. 
Control then jumps back to step S230. 

[0116] In step S310, the document or sub-page defined by the current state is added to the re-authored page cache 
10 as a first re-authored page or a next re-authored sub-page suitable for delivery to the requesting limited display area 
device. Then, in step S320, a determination is made whether there are any sub-pages resulting from the good-enough 
sub-page that has been added to the re-authored page cache. If there are any such sub-pages that still need to be re- 
authored, control continues to step S330. Otherwise, control jumps to step S340. 

[0117] In step S330, a state of the search space corresponding to one of the sub-pages to be re-authored is selected 
'5 as the current state. Control then jumps back to step S230. In contrasl, since there are no further sub-pages that need 
to be re-authored, in step S340, the first re-authored page is output to the requesting limited display area device. Then, 
in step S350, the control routine ends. 

[0118] Fig. 12 outlines one exemplary embodiment of an elision transform according to this invention. As shown in 
Fig. 12, the elision transform routine begins in step S400, and continues to step S410, where a portion of a current 

20 page or sub-page to be removed is selected. Then, in step S420, the selected portion is copied into a new sub-page. 
Next, in step S430, an identifier is generated for the selected portion. In general, the identifier will be generated using 
some content of the selected portion. For example, if the selected portion is a paragraph or other text string, the identifier 
will be the first sentence or the first portion of the first sentence of the selected text portion. If the selected portion is 
an image, the identifier could be a portion of text used to identify the image in the web document. Control then continues 

25 to step S440. 

[0119] In step S440, a link is generated to link the current page or sub-page with generated sub-page. Then, in step 
S450, the selected portion is removed from the current page or sub-page and the identifier and the link are added to 
the current page. Next, in step S640, the control routine stops. 

[0120] Fig. 13 outlines one exemplary embodiment of a table transform according to this invention. As shown in Fig. 

30 13, the table transform begins in step S500 and continues to step S505, where a top level table is selected as the 
current table. Then, in step S51 0, the current table is checked to determine if there are any nested tables in the current 
table. If so, control continues to step S515. Otherwise, control jumps to step S520. In step S515, one nested table of 
the current table is selected as the new current table. Control then jumps back to step S510, to determine if there are 
nested tables in this nested table selected as the current table. 

35 [0121] Once there are no nested tables in the current table, in step S520, the current table is checked to determine 
if there are any sidebars in the current table. If so, control continues to step S525. Otherwise, control jumps to step 
S535. In step S525, a link list is generated from all of the links in all of the sidebars of the current table. Then, in step 
S530, the link list is placed at the end of the current table. Control then continues to step S535. 
[0122] In step S535, the current table is divided into two or more portions. In particular, as indicated above, one 

<to method for dividing the current table into portions is to divide each cell of the table into a separate portion. Then, in 
step S540, each portion of the current table is copied into a separate new sub-page, and "Next" and "Previous" links 
are added to each such sub-page. Next, in step S545, the current table is replaced with the set of linked sub-pages 
generated in step S540. Control then continues to step S550. 

[0123] In step S550, the current table is checked to determine if it is the top level table. If not, there is at least one 
'5 higher level table that still needs to be divided into portions. Accordingly, control continues to step S555. Otherwise, 
control jumps to step S560. 

[01 24] In step S555, the table that contains the current table is selected as the new current table. Control then jumps 
back to step S510, to determine if there any more nested tables in the current table. In contrast, in step S560, the 
control routine ends. 

>o [0125] Fig. 14 is a flowchart outlining one exemplary embodiment of an image reduction transformation according 
to this invention. Beginning in step S600, the image reduction transformation continues to step 561 0, where the image 
to be reduced in the current sub-page is selected. Then, the reduced image is generated based on the reduction factor 
associated with the particular image reduction transformation being applied. Then, in step S630, the current sub-page 
is analyzed to determine if the selected image has been previously reduced. lf'so : control jumps to step S670. Other- 

s wise, control continues to step S640. 

[01 26] In step S640, the selected image is copied to a new sub-page. Next, in step S650, a link to the new sub-page 
is generated. Then, in step S660, the full-size image is removed from the current page or sub-page, and the reduced 
image and the generated link are added to the current page to form the re^authored page. Control then jumps to step 
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mi271 IncortraatinstepSSTO.rather*^ 

"1 S s removed from the current sub^age and the new reduced image is added to the current sub-page^ 
H^wevt SaJL the current sub-page should already have a link to the previously-created sub-page conta.n.ng the 
^^ST^rm^m, to again add the link to the current sub-page or to create a new sub-page stonng 
that full-sized imaqe. Control then continues to step S680, where the control routine ends. 

01 281 Even with perfect automatic re-authoring of documents, there is often simply too much .nlormation in a typcal 
web documenuo r4ke serendipitous cellular phone web browsing a pleasurable or proftable past-time^ due jto the 
Trv small text-only-type display used in cellular phones. Typically, these devices and serv,ces w I be used to .nd and 
oZeM^Z the use is specifically looking for. That s, these devices and services w... be used for targeted 
S^n^Tand extraction. The document filtering systems and methods of this invention allow users to extrac 
^ZlrtZZenls^mevare interested in, via asimp.e, end-user scripting language tha combines structural 
page navigation commands with regular expression pattern matching and report generation functions _ 
TO1291 The SPHINX system, as described in R. Miller etal., "SPHINX: a framework for creating personal, s,te-specf,c 
^J^^ZSTrtenwrtional Wor.d-Wide Web Conference. Brisbane, Australia. April 19 8. providesa v,sua 
tool thaTets users create custom -personal" web crawlers that are similar in f unct.onal.ty to the t..ter.ng mechanism of 
Te sSems aS methods of this Mention. The Internet Scrapbook. as described in A. Sugiura et a. Scmp- 
l™ng Web browsing tasks by programming-by-demonstration", Seventh International WorW-Wide Web 
SerencT BnlLne, Austral* April 1 998. allow users ,o v.sua.fy select elements from web pages and then updates 
these elements in a "scrapbook" when the web pages change, providing a function that ,s s.m,lar to the page element 
^ JaZT^cuto page of the systems and methods of this invention. Several commercial products a so prov.de 

ZSSnXLrtb-d i com. inc., http://www.head.iner.com, and OnDisp.a/s O*^;" <^*£ 
mc, http://www.ondisplay.com, both provide visual editors that let users specify which struc ura, parts 
oi web pages to extract. However, neither of these systems provide users with any ability to extract content based on 

from a documenl based on commands written by a user in a high-level scr.pt.ng language. The document filtering 
siTems and methods of this invention combine page structure navigation, regular expression match.ng, s.te traversa 
30 Z w7b c™L 9 ; and iterative matching, in addition to re-authoring of the extracted informal us.ng the document 
re-authorina systems and methods of this invention described above. 
0131] ^ALrscriptissimply entered intoatext file and saved onawebserver. The 

a s reques ts Sniform Resource Locator. A filter script will typical* load a targe, web page, traverse to particula 
focJtions within that web page, which are described structurally and/or by regular expressions, extract the content 

formatted before being returned to the user. 

S The document filtering systems and methods of this invention take advantage of the parse tree cmbonand 
navigation of the document re-authoring systems and methods of this invention, by providing a simple set of HTML 
acumen? navigation options that use the concept of a "current context" in the HTML document The current context 

40 Ta2Z»s to a "cursor" in database programming, in that I refers to a location with.n HTML the ^ument. 

01 33 ^ actuality the current context refers to a node in the HTML parse tree. The nav.gat,on commands serve to 
Shi ?eten e y a ound within the tree until a desired part of the HTML document is found, at wh.chtime the des.red 
parTcan be extracted. For example, Fig. 10 shows an HTML document and its corresponding parse ^ When he 
document is first loaded, by executing a "GO URL" command, the current context ,s pointing at the root node of the 

45 narse tree which essentially refers to the entire document. f , Mr 
roiSl Fig 15 shows one exemplary embodiment of the document re-authoring system 600 further inc.ud.ng a f.lte 
S 6 90 hat implements the document tittering systems and methods out.ined herein. In particular, the filter or u 
690 under control o. the controller 610. inputs a requested filter, requested by the user over one of the communica on 
hnks 522 or 560 that is supplied from a node of the distributed network storing such a filter over the communicat on 

so M^ZfL*** S^then inputs the requested document from the node of the distributed net work stor.g the 
requested document and filters the requested document to extract the requested page elements. The f We c rcu.H 90 
stores these extracted page elements in the original page memory 631 in place of the or,g.nal document ,n.t,ally stored 
there The dement re-authoring system 600 then operates on these extracted page e.ements as ,f they were the 

ss S5J d Z™^^*«*™* from the original document, the filter circuit 690 uses the abstract syntax tree 
gene'Ited byl^ ^abstract symax tree generating circurt from the original document and stored ,n the abstract syntax 

^outlines one exempt embodiment of me information flow when the requested document is also to 
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be filtered. As shown in Fig. 16, after a request for filter is output by the limited display area device 510 to the HTTP 
proxy server 571 , the request for filter is forwarded by the HTTP proxy server 571 to a remote node 592 of the new 
distributed network that stores the requested filter. The remote node 592 storing the requested filter returns the re- 
quested filter to the document filter 690. The document filter 690 then requests, under control of the controller 61 0, the 

5 document from the remote node 591 of the distributed network that stores the request page. The remote node 591 
storing the requested page returns the document to the document filter 690. The document filter 690 then filters the 
returned document using the filter returned from the remote node 592 and the abstract syntax tree generated by the 
abstract syntax tree generating circuit 640. The document filter 690 returns the extracted page elements to the docu- 
ment re-authoring system 600 where the extracted page elements are treated as an original document for re-authoring 

10 as described above. 

[0137] There are three types of page navigation commands, those which go intoXUe current context to select more 
specific content, those which go out from the current context to enclosing structures, and those which traverse the 
page sequentially from the start of the current context, for example, to navigate to the next structure of some kind, 
which may or may not be properly contained within the current context. 
'5 [0138] The simplest type of navigation command goes into the current context. For example, given the document 
and current context shown in Fig. 10, executing the command "GO ROW 2" results in the current context being moved 
to the second table row object within the current context, as shown in Fig. 17. 

[0139] The current context can also be enlarged, i.e., moved up the parse tree towards the root node, by using a 
"GO ENCLOSING" command. For example, given the document and context shown in Fig. 17, a "GO ENCLOSING 

20 TABLE" command results in the current context shown in Fig. 18. 

[01 40] Finally, the current context can be moved forwards or backwards among the objects in a page in a sequential 
manner, as they appear to a user. This is accomplished by moving the current context forwards or backwards from its 
current location within a prefix traversal of the parse tree. This results in a search that first is performed within the 
current context, then continues with the objects that follow the current context on the page. For example, a "GO PRE- 

25 VIOUS IMAGE" command moves to the previous image found sequentially from the current context. 

[0141] In addition to named page elements, navigatbn commands can also be specified using regular expressions. 
For example, a "GO NEXT" "DOW\sJONES\s*(\d+)\s*POINTS"" command moves the current context to the next match 
of the specified regular expression, using a prefix traversal of text blocks on the page. The filtering systems and methods 
of this invention are able to demarcate sub-expressions and recall them into output strings. 

30 [0142] The simple navigation commands described above can also be used to navigate among a set of linked web 
pages through the use of the "LINKEDPAGE" page object type. For example, a "GO FIRST LINKEDPAGE" command 
v moves to the first hypertext link within the current context, loads the referenced page and moves the current context 
to the root of that document's parse tree, while a "GO ENCLOSING LINKEDPAGE" command returns the current 
context to the hypertext link that led to the document currently being processed. 

35 [01 43] Traversal between pages is handled by a stack of script activations, each of which pairs script state information 
(including current context) with a particular Uniform Resource Locator and a parse tree. This facilitates rapid navigation 
back and forth among linked pages and is required to support the "GO ENCLOSING LINKEDPAGE" command. 
[0144] Once the current context has been moved to a page object that is of interest, a "REPORT" command is used 
to extract it. The "REPORT" command can be issued several times within a filter script, in which case the extracted 

40 page elements are concatenated. The "REPORT" command can also be used to insert arbitrary strings into the output, 
which can contain sub-strings from regular expression pattern matching. For example, the "REPORT "Dow:\1"" com- 
mand adds the string "Dow: " plus a substring identified by the identifier " 1 " extracted during a regular expression match 
to the filter's output. 

[0145] Often the user does not know in advance how many page elements of a particular kind will exist on a web 
45 page. For example, the number of news article paragraphs in a daily e-zine will generally not be known in advance. 
The "FOREACH" command addresses this lack of information by executing a sequence of commands for every page 
element found within the current context that meets a specified criteria. When used with a "LINKEDPAGE" target, this 
provides the functionality of a web spider that can visit all of the linked pages within a web site. In the following examples 
the ellipses represent sequences of valid filter commands: 
so [0146] A "FOREACH PARAGRAPH" command moves to each paragraph within the current context in turn DO... 
END and executes the specified commands. 

[0147] A "FOREACH LINKEDPAGE" command loads each page that is reachable through hypertext links from the 
DO... END current page in turn and executes the specified commands. 

[0148] Whenever, a filter encounters any kind of error, including navigation failures, regular expression matching 
55 failures, or web page retrieval error, it simply begins the next iteration of the innermost "FOREACH" loop in which the 
offending command is embedded. If the error occurred at the top level of a filter, the filter halts execution and produces 
any pending output. 

[01 49] The document re-authoring systems and methods of this invention do a good job of automatically re-authoring 
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documents for display on devices with small screens. One exemplary embodiment of the document ^uthor.ng sys- 
tems and methods o, this invention have been informally tested on a wide range of pages for a number of 
ThTs exemplary embodiment of the document re-authoring systems and methods of th.s .nvent. 0 n produced output that 

ra^lJS^SSSSry embodiment, the document re-authoring systems and methods of this invention simp* add 
upthespacerequirementsofallimagesand^^ 

This is adequate for fairly dense documents with mhimal structure, such as those ,n a Xerox Annual 
poorly for documents with a lot of whitespace or which use advanced layout techniques, such as, for example, tablet 
^eccXxemp.ary embodiment, the document re-authoring systems and methods of th,s .nvent.on .nclude >e is ze 
estimator that performs much of the work performed by a browser ,n formatting each document verson onto a ^splay 
aSa Factors Sher than required screen area may also need to be included, such as actual wdth requ.rements of the 
re authoS ^ument, because users donl like to scroll horizontally, bandwidth requirements, 

r01S1l Users should be able toadjust the various heuristics used in the document re-author.ng systems and methods 
222 — to suit their taste. For example, the user cou.d specify the relate preference of the tran« 0 " 
Techniques, or specify that some transforms not be used at all. At a higher level of abstractive usercouk ^express 
Seterences for a set of trade-offs, such as 'more content' vs. 'larger representation'. In add.t.on, the re-authonng 
sys^ms anc! Ze^s ot this invention could be moved to the client and coupled with the browser so that the user 
could dynamically apply and undo different transformations until the user achieves a result the user hkes. 
[01T2] The automate document re-authoring systems and methods of this invention, and h pabular, the exemplary 
embodiment of the HTTP proxy server described above, are preferably implemented on a programmed general purpose 
Tmp^er However, the automatic document re-authoring systems and methods of this invent.on and,np^ 
the HTTP proxy server described above, can also be implemented on a special purpose computer, a programmed 
m^ropTccessor or microcontroller and peripheral integrated circuit e.ements, an ASIC or other .ntagrated arc _a 
Sal signa. processor, a hardwired e.ectronic or logic circuit such as a discrete element c.rcurt a programmable tog c 
dele such' as a PLD, PLA, FPG A or PAL, or the like. In general, any device, capable of .mplement.ng a fmrte state 
maThhe can L used to implement the automatic document authoring system and method of th.s .nvent.on, and .n 
oarticular the HTTP proxy server described above. 

£ T^ automat document re-authoring systems and methods according to this invent.on can be ^tomjd^ 
Invoking a stand-alone re-authoring program running on the HTTP proxy server desenbed above, or can be performed 
through a plug-in to a conventional web browser, such as Netscape Navigator or the like. 

[015* Furthermore, while the automatic document re-authoring systems and methods of th.s .nventon have been 
descr bed in relation to re-authoring documents obtained from the worki-wide web. the automat.c re-author.ng systems 
t^S^SZ invention can be used to re-author documents obtained from any distributed n*™^^ 
fial area network, a wide area network, an intranet, the Internet, or any other d.str.buted process.ng and storage 

35 [Sf While this invention has been described in conjunction with the specific embodiments outlined above, it is 
evident that many alternatives, modifications and variations will be apparent to those skilled .n the 
preferred embodiments of the invention set forth above are intended to be illustrative, not limiting. \fer.ous changes 
may be made without departing from the spirit and scope of the invention. 
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Claims 

1 . A method for automatically re-authoring a document, comprising: 
parsing the document; 

transforming the parsed document into a transformed document; 
generating an evaluation value from the transformed document; 
determining if the evaluation value meets at least one evaluation criterion, 

if the evalution value for the transformed document does not meet the at least one criterion, repeating the 
transforming, generating and determining steps using a different transform; and 
if the evaluation va iue for the transformed document meets the at least one criterion, oU 

document. 

55 2. The method of claim 1, wherein: 

parsing the document comprises generating an abstract syntax tree from the document; and 

transforming the parsed document comprises transforming the abstract syntax tree .nto at least one trans- 
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formed abstract syntax tree. 
The method of claim 1 or claim 2, wherein transforming the parsed document comprises: 
selecting a transform; 

determining if the transform can properly be applied to the parsed document; 

if the transform can properly be applied, transforming the parsed document into the transformed document 
using the selected transform; and 

if the transform cannot properly be appliet, repeating the selecting and determining steps for a different trans- 
form. 

The method of any of claims 1 -3, wherein transforming the parsed document into the transformed document com- 
prises at least one of outlining sections of the document, removing contentless portions from the document, re- 
moving content from the document, reducing a size of ad least one image within the document, removing at least 
one image from the document, removing at least one table cell from the document, and summarizing text within 
the document. 

The method of claim 4, wherein: 

outlining sections ofthe document preferably comprises: 
identifying sections within the document, 

identifying a section header and a document portion for each section, 
placing each identified document portion into a separate subpage, * 

removing the identified document portions from the parsed document to form a transformed document 
containing only the identified sections headers, 

converting each of the identified section headers into a link to the corresponding subpage, and 
linking the separate subpages together and to the transformed document; 

reducing a size of at least one image within the document preferably comprises: 

identifying at lead one image within the document, 
placing each identified image into a separate subpage, 
generating a reduced version of each identified image, 

removing each identified image from the document and inserting the reduced version of each removed 
image to form the transformed document, and 

adding, for each removed image, a link into the reduced version of that image to the subpage containing 
that removed image; 

removing at least one image from the document preferably comprises one of removing all images from the 
document, removing all but the first image from the document, and removing all but the first and last images 
from the document; 

removing at least one table cell from the document preferably comprises; 
determining if the table contains any sidebars of links, 

if the table contains any sidebars, converting the sidebars into a list of links as a last cell of the table, 
identifying all but the first cell of the table, 
adding each identified cell to a separate subpage, 

replacing the table with the first cell to form the transformed document, and 
linking the separate subpages together and to the transformed document, and 
removing at least one table cell from the document preferably further comprises: 

determining if that cell is a nested table, 

if that cell is not a nested table, adding that cell to the separate subpage, and 

if that cell is a nested table, repeating the determining, converting, identifying, adding, replacing and 

linking steps; and 

removing contentless portions from the document preferably comprises at least one of replacing sequenc- 
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es of page breaks or paragraph breaks with a single page break or paragraph break, removing indenting 
irom the document; converting text strings of the document to at least one of a single font and font size, 
removing bullets from the document, removing background space from the document and removing ban- 
ner images from the document. 

The method of any of claims 1 -5. wherein, if no transform results in a transformed document that has an evaluation 
value that meets the a least one evaluation criterion, the method further comprises: 

selecting the transformed document having the evaluation value that most closely meets the evaluation value. 

repeating the transforming, generating and determining steps on the selected transformed document using 
art additiond transform. 



7. The method of any of claims 1 -6, wherein: 

transforming the parsed document into a transformed document comprises generating at least one subpage; 

when a transformed document meets the at least one evaluation criterion, the method further comprises: 

Generating an evaluation value for each generated subpage for that transformed document; 

detaining, for each subpage, if the evaluation value for that subpage meets the at least one evaluate 

Seach subpage, the evaluation value for that subpage does not meet the at least one criterion, performing 
the transforming, generating and determining steps on that subpage using an additional one of the transforms 
to qenerate a transformed subpage; and 

for each subpage, if that subpage meets the at least one criterion, identifying that subpage as ready to be 
output. 

8. The method of claim 1 , further comprising, after parsing the document: 

optionally removing contentless portions from the document to form a pre-transformed document 
generating an evaluation value from the document or the pre-transformed document; 
determining if the evaluation value meets at least one evaluation criterion; 

if the evaluation value for the document or the pre-transformed document does not meet the at 'east one 
criterion, performing the transforming, generating and determining steps using a first one of he transforms, and 
if the evaluation vahve for the document or the pre-transformed document meets the at least one cntenon, 
outputting the document or the pre-transformed document without removing any content from the document. 

9. The method of any of claims 1 -8, wherein transforming the document comprises: 

filtering the document to extract desired portions of the document; and 
replacing the document with the extracted portions. 

10. A document re-authoring system that automatically re-authors a document, comprising 

a parse tree general jig circuit that parses the document to generate a parse tree; 

a transform circuit that transforms the parse tree using a first transform to generate a transformed parse t ee 
representing a transformed document, and that preferably transforms the parse tree or the transformed parse 
tree using another transform to generate another transformed parse tree representing another transformed 

TdZm^ze evaluation circuit that evaluates the parse tree or the another transformed parse tree to 
determine if the document, the transformed document or the another transformed document meets at least 

one evaluation criterion; . . 

wherein, when the document, the transformed document or the another transformed document meets he a 
least one evaluation criterion; the document, the transformed document or the another transformed document 

is output to a display device. 
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